Analyzing my Spotify Streaming History#
Author: Noah Stemen
Course Project, UC Irvine, Math 10, Summer 2023
Introduction#
The goal of this project is to explore aspects of my perosonal streaming history as provided by Spotify and examine the unique variables provided by the “1 Million Tracks” dataset. After creating a new dataframe of only what is included in both, I will explore a few trends of the variables with visuals and use linear regression to make a predictory line between two columns of data.
Creating & Filtering my “Streaming History” DataFrame#
import pandas as pd
#first, lets import all of our files and use "sh" for "streaming history"
sh1 = pd.read_json("StreamingHistory0.json") #10,000 rows
sh2 = pd.read_json("StreamingHistory1.json") #10,000 rows
sh3 = pd.read_json("StreamingHistory2.json") #10,000 rows
sh4 = pd.read_json("StreamingHistory3.json") #10,000 rows
sh5 = pd.read_json("StreamingHistory4.json") #1,650 rows
#next, because each file is of my streaming history during a different period of time, let's combine them
sh = pd.DataFrame()
sh = sh.append([sh1, sh2, sh3, sh4, sh5])
#to check it is properly combined, let's check the shape of this new dataframe - it should be 41,650 rows
sh.shape
(41650, 4)
sh
#pulling this up to get the column names
| endTime | artistName | trackName | msPlayed | |
|---|---|---|---|---|
| 0 | 2022-09-08 00:35 | Jon Bellion | All Time Low | 119820 |
| 1 | 2022-09-08 00:35 | Jon Bellion | Eyes To The Sky | 1560 |
| 2 | 2022-09-08 00:38 | Lawrence | False Alarms (with Jon Bellion) | 84710 |
| 3 | 2022-09-08 00:38 | Jon Bellion | While You Count Sheep | 1250 |
| 4 | 2022-09-08 00:39 | Jon Bellion | Blu | 58630 |
| ... | ... | ... | ... | ... |
| 1645 | 2023-09-08 23:37 | Hozier | De Selby (Part 1) | 3940 |
| 1646 | 2023-09-08 23:37 | half•alive | Nobody - Live | 220990 |
| 1647 | 2023-09-08 23:37 | half•alive | What's Wrong | 36933 |
| 1648 | 2023-09-08 23:45 | Hozier | Unknown / Nth | 280106 |
| 1649 | 2023-09-08 23:58 | Hozier | First Light | 292080 |
41650 rows Ă— 4 columns
Next, I want to filter out certain tracks recorded I don’t want included. More specifically, I personally listen white noise and rain sounds on Spotify frequently when getting work done as a means to help me keep focus. Because of how many minutes would be recorded of this and it’s irrelevancy to an analysis of my taste in music, let’s remove it.
# first, let's find the full name of the white noise track
sh[sh['trackName'].str.contains('Noise')]
# "White Noise 2 Ho...", "White Noise 3 Ho...", and "Sleep Sounds Rai..." are all white noise tracks
# "Street Noise" & "Turn Off The Noise" are songs we will include in our work
| endTime | artistName | trackName | msPlayed | |
|---|---|---|---|---|
| 1367 | 2022-09-25 03:41 | Erik Eriksson | White Noise 2 Hour Long | 4216 |
| 1383 | 2022-09-25 04:05 | Erik Eriksson | White Noise 2 Hour Long | 560 |
| 5253 | 2022-11-01 18:33 | Erik Eriksson | White Noise 2 Hour Long | 612690 |
| 5254 | 2022-11-01 18:33 | Erik Eriksson | White Noise 2 Hour Long | 9600 |
| 5256 | 2022-11-01 19:42 | Erik Eriksson | White Noise 2 Hour Long | 2822980 |
| 5257 | 2022-11-01 19:46 | Erik Eriksson | White Noise 2 Hour Long | 255920 |
| 5258 | 2022-11-01 20:43 | Erik Eriksson | White Noise 2 Hour Long | 1380030 |
| 5259 | 2022-11-01 21:21 | Erik Eriksson | White Noise 2 Hour Long | 666990 |
| 5367 | 2022-11-02 04:23 | Erik Eriksson | White Noise 2 Hour Long | 16362 |
| 1684 | 2023-01-10 18:39 | Erik Eriksson | White Noise 2 Hour Long | 2326260 |
| 2009 | 2023-01-14 21:19 | Erik Eriksson | White Noise 3 Hour Long | 826080 |
| 2831 | 2023-01-24 18:32 | Erik Eriksson | White Noise 3 Hour Long | 5289087 |
| 2833 | 2023-01-24 19:00 | Erik Eriksson | White Noise 3 Hour Long | 1660 |
| 3012 | 2023-01-26 06:28 | Erik Eriksson | White Noise 2 Hour Long | 1443330 |
| 4503 | 2023-02-02 18:09 | Thymes | Street Noise | 1045 |
| 4504 | 2023-02-02 18:09 | Thymes | Street Noise | 30859 |
| 4505 | 2023-02-02 18:11 | Thymes | Street Noise | 83412 |
| 4758 | 2023-02-03 19:31 | Thymes | Street Noise | 114000 |
| 9490 | 2023-03-05 14:29 | Relaxing White Noise | Sleep Sounds Rain & Thunderstorm White Noise 8... | 44864 |
| 467 | 2023-03-15 15:45 | Erik Eriksson | White Noise 2 Hour Long | 28053 |
| 1121 | 2023-03-21 04:40 | Erik Eriksson | White Noise 2 Hour Long | 7200649 |
| 1122 | 2023-03-21 05:36 | Erik Eriksson | White Noise 2 Hour Long | 32340 |
| 1136 | 2023-03-21 21:45 | Erik Eriksson | White Noise 2 Hour Long | 252670 |
| 1704 | 2023-03-26 11:52 | Erik Eriksson | White Noise 2 Hour Long | 13888 |
| 4480 | 2023-04-21 04:35 | Erik Eriksson | White Noise 2 Hour Long | 1696290 |
| 4481 | 2023-04-21 04:36 | Erik Eriksson | White Noise 2 Hour Long | 1660 |
| 4510 | 2023-04-21 16:37 | Erik Eriksson | White Noise 2 Hour Long | 4806880 |
| 4842 | 2023-04-24 06:01 | Erik Eriksson | White Noise 2 Hour Long | 3530618 |
| 6083 | 2023-05-03 21:54 | Erik Eriksson | White Noise 3 Hour Long | 5515784 |
| 7228 | 2023-05-10 18:22 | Erik Eriksson | White Noise 2 Hour Long | 928 |
| 7427 | 2023-05-12 00:44 | Erik Eriksson | White Noise 2 Hour Long | 2836290 |
| 7877 | 2023-05-15 06:50 | Erik Eriksson | White Noise 2 Hour Long | 7200649 |
| 7880 | 2023-05-15 07:48 | Erik Eriksson | White Noise 2 Hour Long | 450050 |
| 237 | 2023-06-01 05:56 | Erik Eriksson | White Noise 2 Hour Long | 2410 |
| 238 | 2023-06-01 05:56 | Erik Eriksson | White Noise 2 Hour Long | 1821721 |
| 429 | 2023-06-02 01:38 | Erik Eriksson | White Noise 2 Hour Long | 2261210 |
| 1467 | 2023-06-12 06:26 | Erik Eriksson | White Noise 2 Hour Long | 5634016 |
| 5827 | 2023-07-12 05:46 | Peter McPoland | Turn Off The Noise | 231561 |
| 9019 | 2023-08-14 21:35 | Erik Eriksson | White Noise 2 Hour Long | 4834050 |
| 760 | 2023-08-30 22:47 | Erik Eriksson | White Noise 2 Hour Long | 6316280 |
| 1483 | 2023-09-06 22:55 | Erik Eriksson | White Noise 2 Hour Long | 2640683 |
| 1484 | 2023-09-06 23:45 | Erik Eriksson | White Noise 2 Hour Long | 1550920 |
sh = sh[~sh['trackName'].str.contains('White Noise|Sleep Sounds', case=False)]
# to test this worked, this should now only return the songs "Street Noise" and "Turn off the Noise"
sh[sh['trackName'].str.contains('Noise')]
| endTime | artistName | trackName | msPlayed | |
|---|---|---|---|---|
| 4503 | 2023-02-02 18:09 | Thymes | Street Noise | 1045 |
| 4504 | 2023-02-02 18:09 | Thymes | Street Noise | 30859 |
| 4505 | 2023-02-02 18:11 | Thymes | Street Noise | 83412 |
| 4758 | 2023-02-03 19:31 | Thymes | Street Noise | 114000 |
| 5827 | 2023-07-12 05:46 | Peter McPoland | Turn Off The Noise | 231561 |
Now, let’s make sure there are no missing values in any of our columns or rows.
sh.isnull().sum()
# nope, we're all good to go :)
endTime 0
artistName 0
trackName 0
msPlayed 0
dtype: int64
We also need to convert the “endTime” column into usable dates instead of strings for later use. I will also create a new “minutesPlayed” column from the “msPlayed” column for the sake of legibility of units.
sh["endTime"] = pd.to_datetime(sh["endTime"])
sh["minutesPlayed"] = sh["msPlayed"]/60000
#I'm not sure why this caveat description at the bottom is popping up
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Filtering the “1 Million Songs” DataFrame into a new sub-DataFrame#
#let's use "ms" for "million songs"
ms = pd.read_csv("spotify_data.csv")
ms.isnull().sum()
#for a dataframe of 1 million tracks, miraculously there are no missing values
Unnamed: 0 0
artist_name 0
track_name 0
track_id 0
popularity 0
year 0
genre 0
danceability 0
energy 0
key 0
loudness 0
mode 0
speechiness 0
acousticness 0
instrumentalness 0
liveness 0
valence 0
tempo 0
duration_ms 0
time_signature 0
dtype: int64
df_ms = pd.merge(ms, sh, left_on=['track_name', 'artist_name'], right_on=['trackName', 'artistName'], how='inner')
#Now I do not need repeat columns of the same information (and one random column)
df_ms = df_ms.drop(['artistName', 'trackName', 'Unnamed: 0'], axis=1)
df_ms
| artist_name | track_name | track_id | popularity | year | genre | danceability | energy | key | loudness | ... | acousticness | instrumentalness | liveness | valence | tempo | duration_ms | time_signature | endTime | msPlayed | minutesPlayed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Jason Mraz | I Won't Give Up | 53QF56cjZA9RTuuMZDrSA6 | 68 | 2012 | acoustic | 0.483 | 0.303 | 4 | -10.058 | ... | 0.69400 | 0.000 | 0.115 | 0.1390 | 133.406 | 240166 | 3 | 2023-03-01 20:05:00 | 36608 | 0.610133 |
| 1 | Jason Mraz | I Won't Give Up | 53QF56cjZA9RTuuMZDrSA6 | 68 | 2012 | acoustic | 0.483 | 0.303 | 4 | -10.058 | ... | 0.69400 | 0.000 | 0.115 | 0.1390 | 133.406 | 240166 | 3 | 2023-03-01 20:09:00 | 200447 | 3.340783 |
| 2 | Jason Mraz | I Won't Give Up | 53QF56cjZA9RTuuMZDrSA6 | 68 | 2012 | acoustic | 0.483 | 0.303 | 4 | -10.058 | ... | 0.69400 | 0.000 | 0.115 | 0.1390 | 133.406 | 240166 | 3 | 2023-03-01 23:54:00 | 6498 | 0.108300 |
| 3 | Neon Trees | Everybody Talks | 2iUmqdfGZcHIhS3b9E9EWq | 77 | 2012 | alt-rock | 0.471 | 0.924 | 8 | -3.906 | ... | 0.00301 | 0.000 | 0.313 | 0.7250 | 154.961 | 177280 | 4 | 2022-09-10 06:39:00 | 177280 | 2.954667 |
| 4 | Neon Trees | Everybody Talks | 2iUmqdfGZcHIhS3b9E9EWq | 77 | 2012 | alt-rock | 0.471 | 0.924 | 8 | -3.906 | ... | 0.00301 | 0.000 | 0.313 | 0.7250 | 154.961 | 177280 | 4 | 2022-09-12 04:58:00 | 177280 | 2.954667 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 27858 | The Drums | I Don't Know How To Love | 2YvWonOJesvP0yu9IFJY7S | 61 | 2011 | rock | 0.411 | 0.890 | 11 | -6.062 | ... | 0.06170 | 0.127 | 0.227 | 0.0669 | 169.970 | 202054 | 4 | 2023-03-13 04:36:00 | 202053 | 3.367550 |
| 27859 | The Drums | Days | 6113aOfHIC0vbZVDZ6PpRV | 44 | 2011 | rock | 0.586 | 0.721 | 2 | -7.743 | ... | 0.36900 | 0.160 | 0.141 | 0.6570 | 84.987 | 269082 | 4 | 2023-03-03 19:16:00 | 269081 | 4.484683 |
| 27860 | The Drums | Days | 6113aOfHIC0vbZVDZ6PpRV | 44 | 2011 | rock | 0.586 | 0.721 | 2 | -7.743 | ... | 0.36900 | 0.160 | 0.141 | 0.6570 | 84.987 | 269082 | 4 | 2023-03-25 00:26:00 | 105290 | 1.754833 |
| 27861 | The Drums | Days | 6113aOfHIC0vbZVDZ6PpRV | 44 | 2011 | rock | 0.586 | 0.721 | 2 | -7.743 | ... | 0.36900 | 0.160 | 0.141 | 0.6570 | 84.987 | 269082 | 4 | 2023-03-28 04:55:00 | 269081 | 4.484683 |
| 27862 | The Drums | Days | 6113aOfHIC0vbZVDZ6PpRV | 44 | 2011 | rock | 0.586 | 0.721 | 2 | -7.743 | ... | 0.36900 | 0.160 | 0.141 | 0.6570 | 84.987 | 269082 | 4 | 2023-03-28 05:17:00 | 2070 | 0.034500 |
27863 rows Ă— 22 columns
I’d also like to briefly look at the start and end dates of when streams were recorded for the sake of seeing how long my data was recorded and if there were any changes to the earliest and most recent streams included.
earliest_date = sh['endTime'].min()
latest_date = sh['endTime'].max()
earliest_date_kept = df_ms['endTime'].min()
latest_date_kept = df_ms['endTime'].max()
earliest_date
Timestamp('2022-09-08 00:35:00')
earliest_date_kept
Timestamp('2022-09-08 00:35:00')
latest_date
Timestamp('2023-09-08 23:58:00')
latest_date_kept
Timestamp('2023-09-08 23:37:00')
Out of the 41,650 initial streams recorded in my streaming history, we are left with 27,863 streams. From a random sample of 1 million songs, to still be left with 66% of what I have streamed is impressive. But unfortunately that does mean that 34% of what I have streamed will not be represented in Altair charts or any work going forward. Also, I personally know that I downloaded Spotify back in 2021, so even the inital “sh” dataframe contained only streams going back to one year ago and not my “entire” streaming history. With all of that being said, my work going forward will show include this a sample of my overall streaming history that will be used in the rest of the project.
Determining my Most Streamed Songs and Artists#
Next, instead of having 27,863 rows for each stream, lets combine each stream by song/artist and create two new columns; “total_times_streamed,” and “total_minutes_streamed.”
most_songs = df_ms.groupby(['artist_name', 'track_name']).agg(
total_times_streamed=('minutesPlayed', 'count'),
total_minutes_streamed=('minutesPlayed', 'sum')
).reset_index()
most_songs
streams = pd.merge(most_songs, df_ms, left_on=['artist_name', 'track_name'], right_on=['artist_name', 'track_name'], how='inner')
#i no longer need the endTime, msPlayed, or minutesPlayed columns
#as they are keeping me from being able to drop any duplicates and are accounted for in
# the total number and total time listened
streams = streams.drop(["endTime", "msPlayed", "minutesPlayed"], axis=1)
streams = streams.drop_duplicates()
streams
| artist_name | track_name | total_times_streamed | total_minutes_streamed | track_id | popularity | year | genre | danceability | energy | ... | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms | time_signature | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | $uicideboy$ | $uicideboy$ Were Better In 2015 | 1 | 2.389100 | 6LoaYlv0bC1TyctuADqNFh | 66 | 2022 | hip-hop | 0.883 | 0.8220 | ... | -4.029 | 0 | 0.1080 | 0.0301 | 0.000002 | 0.1110 | 0.3270 | 110.024 | 143347 | 4 |
| 1 | $uicideboy$ | 1000 Blunts | 1 | 2.924600 | 09riz9pAPJyYYDVynE5xxY | 75 | 2022 | hip-hop | 0.830 | 0.6980 | ... | -6.517 | 0 | 0.0770 | 0.2240 | 0.000001 | 0.1910 | 0.5950 | 132.990 | 175476 | 4 |
| 2 | $uicideboy$ | Antarctica | 2 | 0.030167 | 5UGAXwbA17bUC0K9uquGY2 | 77 | 2016 | hip-hop | 0.715 | 0.6330 | ... | -6.869 | 1 | 0.0804 | 0.5530 | 0.000004 | 0.0905 | 0.3190 | 105.945 | 126850 | 5 |
| 4 | $uicideboy$ | Avalon | 1 | 0.266917 | 7CxFWAnQ8eqiRL4W12Xzb6 | 68 | 2021 | hip-hop | 0.877 | 0.6000 | ... | -4.577 | 1 | 0.0813 | 0.0210 | 0.000054 | 0.2440 | 0.1760 | 149.996 | 140859 | 4 |
| 5 | $uicideboy$ | For the Last Time | 5 | 7.830050 | 240audWazVjwvwh7XwfSZE | 74 | 2017 | hip-hop | 0.844 | 0.5330 | ... | -9.612 | 1 | 0.5520 | 0.0735 | 0.000003 | 0.0953 | 0.2300 | 140.078 | 156081 | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 27855 | soho | At Peace | 2 | 0.010433 | 7fJ1v1CninD1DsfNVbs4HU | 34 | 2018 | chill | 0.809 | 0.3040 | ... | -12.764 | 1 | 0.2180 | 0.9630 | 0.898000 | 0.1080 | 0.5270 | 79.971 | 120000 | 4 |
| 27857 | thuy | girls like me don't cry | 1 | 0.097000 | 2DtUUBwYwEzKMTMDrc5EiO | 64 | 2022 | chill | 0.871 | 0.3720 | ... | -9.077 | 0 | 0.0413 | 0.2530 | 0.000002 | 0.1040 | 0.6080 | 110.011 | 214387 | 4 |
| 27858 | thuy | universe | 1 | 0.091667 | 7B4UxdHwRKJYRhvXxmgZhM | 62 | 2021 | chill | 0.636 | 0.4520 | ... | -8.298 | 1 | 0.0329 | 0.1360 | 0.000002 | 0.1040 | 0.0678 | 80.004 | 186627 | 4 |
| 27859 | Ólafur Arnalds | Saudade (When We Are Born) | 3 | 7.500000 | 1ijwLR1iybtxaUbasUj7kJ | 59 | 2021 | ambient | 0.289 | 0.0253 | ... | -31.435 | 1 | 0.0376 | 0.9940 | 0.919000 | 0.0837 | 0.1380 | 99.801 | 150000 | 4 |
| 27862 | Ólafur Arnalds | So Far | 1 | 0.041017 | 6oVhL0lLUMswqSV3VcKwJO | 50 | 2015 | ambient | 0.462 | 0.3390 | ... | -14.301 | 0 | 0.0418 | 0.8130 | 0.808000 | 0.1080 | 0.0395 | 115.015 | 272014 | 4 |
3742 rows Ă— 21 columns
Finally, I have the final form of my dataset that represents my streaming history, time spent listening to each artist or song, and each unique attribute Spotify records for each unique song. To check that this final step was done correctly, the sum of “total_times_streamed” should equal the number of rows in df_ms (27,863)
streams["total_times_streamed"].sum()
27863
Now that we have this final form of my data, let’s determine my most streamed artists and most streamed songs that exists within both my streaming history and the dataset of 1 million songs.
# five most streamed songs
top_tracks = streams.groupby('track_name')['total_minutes_streamed'].sum()
# Sort the tracks by total_minutes_streamed in descending order and select the top 5
# and round it so there aren't any endless decimals
top_tracks.sort_values(ascending=False).head(5).round(3)
track_name
Stories 795.002
Stick Season 715.914
One More Time 666.362
Ode to a Conversation Stuck in Your Throat 645.537
All My Love 607.431
Name: total_minutes_streamed, dtype: float64
# five most streamed artists
top_artists = streams.groupby('artist_name')['total_minutes_streamed'].sum()
# Sort the tracks by total_minutes_streamed in descending order and select the top 5
# and round it so there aren't any endless decimals
top_artists.sort_values(ascending=False).head(5).round(3)
artist_name
Noah Kahan 5857.644
Paramore 3033.930
Taylor Swift 2595.682
Hippo Campus 1629.579
Tyler, The Creator 1564.350
Name: total_minutes_streamed, dtype: float64
According to my filtered dataset of “streams”, my five most streamed songs of the past year are “Stories”, “Stick Season”, “One More Time”, “Ode to a Conversation Stuck in Your Throat”, and “All My Love” and that my five most streamed artists are Noah Kahan, Paramore, Taylor Swift, Hippo Campus, and Tyler the Creator. While a large amount of streams were lost in filtering through the “1 Million Songs” Dataset, I can tell you with confidence that these results are very representative of my taste in music.
Visualizing My Taste in Music#
import altair as alt
This scatterplot shows the entirety of “streams” according to energy and popularity, and the darker colors signfy where my most streamed songs lay with the rest.
alt.Chart(streams).mark_circle().encode(
x='popularity:Q',
y='energy:Q',
color=alt.Color("total_minutes_streamed:Q", scale=alt.Scale(scheme="goldorange")),
tooltip=["artist_name","track_name","total_minutes_streamed", "total_times_streamed","genre"],
)
Unfortunately, that means it is hard to see the rest of my overall taste in music. So, let’s exclude the most streamed songs (those higher than about 300 minutes) as outliers. I will also cut the amount of songs in half so that I can have better accuracy in my interaction with each point when hovering over it.
df_streams2 = streams[streams["total_minutes_streamed"] < 300]
df_streams = df_streams2.sample(frac=0.5, random_state=76)
alt.Chart(df_streams).mark_circle().encode(
x='popularity:Q',
y='energy:Q',
color=alt.Color("total_minutes_streamed:Q", scale=alt.Scale(scheme="goldorange")),
tooltip=["artist_name","track_name","total_minutes_streamed", "total_times_streamed","genre"],
)
Now that I can see more of the higher ends of “total minutes streamed”, I can see that the majority of my streams rest within 40 to 80 on the popularity scale, but are far more spread out across energy. This tells me that while I am more likely to listen to songs that are “heard of but not too popular,” I will listen to any a wider range of energy levels, while still leaning towards higher energy music. Next, I want to see how the year released plays into my taste in music.
yearly_minutes = streams.groupby('year')['total_minutes_streamed'].sum().round(2).reset_index()
chart = alt.Chart(yearly_minutes).mark_bar().encode(
x='year:N',
y='total_minutes_streamed:Q',
tooltip=['year:N', 'total_minutes_streamed:Q']
)
chart
This suggests to me that there is an error with the dataset I did not anticipate, but now somewhat understand. The “year” variable certainly does not represent the year originally released as I anticipated. But it possibly represents the year that the song was added (or last updated/remastered) on Spotify. But even that makes no sense as Spotify was created in 2006, yet the oldest year listed is 2000. But, for whatver true meaning given to year if not actual year released, this shows my preference of listening to music made within the last 6-7 years. The last chart I would like to examine which genres recorded are more listened to than others.
streams_df_genre = streams.sort_values(by='total_minutes_streamed', ascending=False)
chart = alt.Chart(streams_df_genre).mark_bar().encode(
x=alt.X('genre:N', sort='y'),
y='total_minutes_streamed:Q',
tooltip=['genre:N', 'total_minutes_streamed:Q']
)
chart
While I was not expecting “electro” to be my most streamed genre, it does make sense to be up there. This all makes sense to me as genres like pop, rock and indie-pop are sure to be broad enough genres to cover plenty of songs I listen to.
Machine Learning#
Can I use Linear Regression to predict a trend between two variables given? For example, in my selection in music, how do loudness and energy interact?
from sklearn.linear_model import LinearRegression
X = streams[['energy']]
y = streams['loudness']
model = LinearRegression()
model.fit(X, y)
predictions = model.predict(X)
streams['predictions'] = predictions
scatter_plot = alt.Chart(streams).mark_circle().encode(
x='energy:Q',
y='loudness:Q',
tooltip=['energy:Q', 'loudness:Q'],
)
best_fit_line = alt.Chart(streams).mark_line(color='red').encode(
x='energy:Q',
y='predictions:Q',
)
combined_chart = scatter_plot + best_fit_line
combined_chart
I used scikit-learn to create a LinearRegression model, fit it to the data, and calculate predictions to add as a new “predictions” column to my “streams” dataframe. I then created an Altair scatterplot with my streams and a best fit line with the predictions, combining the two into one chart.
Unsuprisingly, there is a positive correlation between “loudness” and “energy.” What is interesting however is that the best-fit line cannot fit linearly for the “dropoff” in energy as loudness decreases. The true nature of a best-fit line for the relationship between “loudness” and “energy” would be closer to that of a square root line, so the best-fit line given fails to properly predict songs with loudness values less than around -15.
Let’s see if we can make a better fitting line with Polynomial Regression (to the second degree).
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2) # You can change the degree as needed
X_poly = poly.fit_transform(X)
poly_reg = LinearRegression()
poly_reg.fit(X_poly, y)
y_pred = poly_reg.predict(X_poly)
streams['predictions_poly'] = y_pred
chart = alt.Chart(streams).mark_circle().encode(
x='energy:Q',
y='loudness:Q',
tooltip=['energy:Q', 'loudness:Q']
)
# Overlay the polynomial regression line on the chart
regression_line = alt.Chart(streams).mark_line(color='red').encode(
x='energy:Q',
y='predictions_poly:Q'
)
chart + regression_line
As you can see, with the change of it being a polynomial to the second dgree now better represents the curve made as both energy and loudness decrease. If I were to add any more polynomials, the curvature provided would be forceful against the true nature of the relationship between loudness and energy and would result in overfitting. The argument can also be made that now the issue has reversed now. Instead of how the data predicted for values less than -15 would be too high, the new polynomial regression curve fails to address the more linear and upward realtionship between the two somewhere around (energy = 0.2, loudness = -15).
Summary#
First, I created a dataframe of my own out of two pre-existing dataframes and determined certain maximums and minimums pertaining to music. Then, I visualized different traits/columns recorded and their nature with one another. Finally, I used Linear and Polynomial regression to make a best-fit line predicitng the nature between two of those traits/columns.
References#
Your code above should include references. Here is some additional space for references.
What is the source of your dataset(s)? – My “Streaming History” comes directly from Spotify – https://www.kaggle.com/datasets/amitanshjoshi/spotify-1million-tracks
List any other references that you found helpful. – (how i used the “merge” feature”) https://www.w3schools.com/python/pandas/ref_df_merge.asp#:~:text=The merge() method updates,keep and which to replace. – learning more about the “.agg” tool https://stackoverflow.com/questions/38174155/group-dataframe-and-get-sum-and-count – unless i am accidentally missing any other references, I spent most of my time looking over lecture codes for references.